Load the necessary libraries
library(car) #for regression diagnostics
library(broom) #for tidy output
library(ggfortify) #for model diagnostics
library(sjPlot) #for outputs
library(knitr) #for kable
library(effects) #for partial effects plots
library(emmeans) #for estimating marginal means
library(ggeffects) #for partial effects plots
library(MASS) #for glm.nb
library(MuMIn) #for AICc
library(DHARMa) #for residual diagnostics plots
library(modelr) #for auxillary modelling functions
library(performance) #for residuals diagnostics
library(see) #for plotting residuals
library(patchwork) #for grids of plots
library(tidyverse) #for data wrangling
theme_set(theme_classic())
Loyn (1987) modeled the abundance of forest birds with six predictor variables (patch area, distance to nearest patch, distance to nearest larger patch, grazing intensity, altitude and years since the patch had been isolated).
Regent honeyeater
Format of loyn.csv data file
| abund | dist | ldist | area | graze | alt | yr_isol |
|---|---|---|---|---|---|---|
| .. | .. | .. | .. | .. | .. | .. |
| abund | Abundance of forest birds in patch- response variable |
| dist | Distance to nearest patch - predictor variable |
| ldist | Distance to nearest larger patch - predictor variable |
| area | Size of the patch - predictor variable |
| graze | Grazing intensity (1 to 5, representing light to heavy) - predictor variable |
| alt | Altitude - predictor variable |
| yr_isol | Number of years since the patch was isolated - predictor variable |
The aim of the analysis is to investigate the effects of a range of predictors on the abundance of forest birds.
loyn = read_csv('../data/loyn.csv', trim_ws=TRUE) %>%
janitor::clean_names()
## Parsed with column specification:
## cols(
## ABUND = col_double(),
## AREA = col_double(),
## YR.ISOL = col_double(),
## DIST = col_double(),
## LDIST = col_double(),
## GRAZE = col_double(),
## ALT = col_double()
## )
glimpse(loyn)
## Rows: 56
## Columns: 7
## $ abund <dbl> 5.3, 2.0, 1.5, 17.1, 13.8, 14.1, 3.8, 2.2, 3.3, 3.0, 27.6, 1.…
## $ area <dbl> 0.1, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2…
## $ yr_isol <dbl> 1968, 1920, 1900, 1966, 1918, 1965, 1955, 1920, 1965, 1900, 1…
## $ dist <dbl> 39, 234, 104, 66, 246, 234, 467, 284, 156, 311, 66, 93, 39, 4…
## $ ldist <dbl> 39, 234, 311, 66, 246, 285, 467, 1829, 156, 571, 332, 93, 39,…
## $ graze <dbl> 2, 5, 5, 3, 5, 3, 5, 5, 4, 5, 3, 5, 2, 1, 5, 5, 3, 3, 3, 2, 2…
## $ alt <dbl> 160, 60, 140, 160, 140, 130, 90, 60, 130, 130, 210, 160, 210,…
Avoid multicollinearity (correlated variables CAN be included to soak up the variance but CANNOT be interpreted while other correlated variables are there!)
# pairs(loyn)
scatterplotMatrix(~abund + dist + ldist + area + graze + alt + yr_isol,
data=loyn, diagonal = list(method = 'boxplot'),
regLine = list(col="red"))
glm(abund ~ ., data = loyn) %>% vif
## area yr_isol dist ldist graze alt
## 1.337627 1.841657 1.227387 1.255028 2.307661 1.574537
Predictors must be transformed, as the family relates only to the response. Log area in particular is particularly skewed.
scatterplotMatrix(~abund + log(dist) + log(ldist) + log(area) +
graze + alt + yr_isol,
data=loyn, diagonal = list(method = 'boxplot'),
smooth=list(col.spread="green"),
regLine = list(col="red"))
glm(abund ~ log(dist) + log(ldist) + log(area) +
graze + alt + yr_isol, data = loyn) %>% vif
## log(dist) log(ldist) log(area) graze alt yr_isol
## 1.654553 2.009749 1.911514 2.524814 1.467937 1.804769
# loyn <- loyn %>%
# mutate(larea = log(area)) %>%
# dplyr::select(-area)
Graze has a high VIF, consider deleting. Also, should maybe be ordinal/categorical instead.
loyn <- loyn %>%
mutate(fgraze = as.factor(graze),
l_dist = log(dist),
l_ldist = log(ldist),
l_area = log(area),
sl_dist = scale(l_dist, scale=F),
sl_ldist = scale(l_ldist, scale=F),
sl_area = scale(l_area),
salt = scale(alt, scale=F),
syr_isol = scale(yr_isol, scale=F)
)
loyn.glm <- glm(abund ~ sl_dist + sl_ldist + sl_area +
fgraze + salt + syr_isol,
data = loyn,
family = gaussian(link = "identity"))
vif(loyn.glm) # none above 3
## GVIF Df GVIF^(1/(2*Df))
## sl_dist 1.907874 1 1.381258
## sl_ldist 2.228456 1 1.492801
## sl_area 2.201116 1 1.483616
## fgraze 7.925036 4 1.295314
## salt 1.597400 1 1.263883
## syr_isol 3.252591 1 1.803494
The rule is we don’t want values above 3 for VIF, otherwise, that variable is correlated to the others in a way that obscures its interpretation.
There’s also a function to easily compare scaled effect sizes of all variables rather than do so here. That way, we can still interpret the function with the appropriate units here, but also check effect sizes later on as well.
Model formula: \[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ log(\mu_i) = \boldsymbol{\beta} \bf{X_i} \]
where \(\boldsymbol{\beta}\) is a vector of effects parameters and \(\bf{X}\) is a model matrix representing the additive effects of the scaled versions of distance (ln), distance to the nearest large patch (ln), patch area (ln), grazing intensity, year of isolation and altitude on the abundance of forest birds.
autoplot(loyn.glm, which = 1:6)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Residuals are ok, normality not great, but should be robust to this. Cook’s d are all low (less than 0.8).
loyn.resid <- simulateResiduals(loyn.glm, plot=T)
Homoscedasticity is ok too.
When you have multiple variables, need to also plot the residuals against each predictor variable too.
loyn$resid <- loyn.resid$scaledResiduals
loyn %>%
ggplot(aes(x = l_dist, y = resid)) +
geom_point() +
geom_smooth(se=F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
loyn %>%
ggplot(aes(x = l_ldist, y = resid)) +
geom_point() +
geom_smooth(se=F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
loyn %>%
ggplot(aes(x = l_area, y = resid)) +
geom_point() +
geom_smooth(se=F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
loyn %>%
ggplot(aes(x = alt, y = resid)) +
geom_point() +
geom_smooth(se=F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
loyn %>%
ggplot(aes(x = yr_isol, y = resid)) +
geom_point() +
geom_smooth(se=F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# Didn't have enough time to finish this last plot
# loyn %>%
#
# group_by(fgraze) %>%
# summarise(mean = mean(resid),
# se = sd(resid) / sqrt( n() ),
# upr = gmodels::ci(resid)[3]) %>%
# ggplot(aes(x = fgraze, y = mean)) +
# geom_bar() +
# geom_pointrange(aes(ymin = lwr, ymax = upr))
loyn.glm %>% ggemmeans(~sl_dist) %>% plot(add=T)
loyn.glm %>% ggemmeans(~sl_ldist) %>% plot(add=T)
loyn.glm %>% ggemmeans(~salt) %>% plot(add=T)
loyn.glm %>% ggemmeans(~syr_isol) %>% plot(add=T)
loyn.glm %>% ggemmeans(~sl_area + fgraze) %>% plot(add=T)
plot_model(loyn.glm, type = 'eff', show.data = T, dot.size = 0.5) %>%
plot_grid
## Warning in plot_grid(.): Not enough tags labels in list. Using letters instead.
Note that if we had used log() and scale() in the glm function, plot_model would do the appropriate back-transform to the original scale, amazingly!
log-area is clearly important, as well as graze = 5.
plot(allEffects(loyn.glm, residuals = T), type = "response")
## Warning in Analyze.model(focal.predictors, mod, xlevels, default.levels, : the
## predictors sl_dist, sl_ldist, sl_area, salt, syr_isol are one-column matrices
## that were converted to vectors
## Warning in Analyze.model(focal.predictors, mod, xlevels, default.levels, : the
## predictors sl_dist, sl_ldist, sl_area, salt, syr_isol are one-column matrices
## that were converted to vectors
## Warning in Analyze.model(focal.predictors, mod, xlevels, default.levels, : the
## predictors sl_dist, sl_ldist, sl_area, salt, syr_isol are one-column matrices
## that were converted to vectors
## Warning in Analyze.model(focal.predictors, mod, xlevels, default.levels, : the
## predictors sl_dist, sl_ldist, sl_area, salt, syr_isol are one-column matrices
## that were converted to vectors
## Warning in Analyze.model(focal.predictors, mod, xlevels, default.levels, : the
## predictors sl_dist, sl_ldist, sl_area, salt, syr_isol are one-column matrices
## that were converted to vectors
## Warning in Analyze.model(focal.predictors, mod, xlevels, default.levels, : the
## predictors sl_dist, sl_ldist, sl_area, salt, syr_isol are one-column matrices
## that were converted to vectors
plot_model(loyn.glm, type='est')
plot_model(loyn.glm, type='est', transform='exp', show.values=TRUE)
summary(loyn.glm)
##
## Call:
## glm(formula = abund ~ sl_dist + sl_ldist + sl_area + fgraze +
## salt + syr_isol, family = gaussian(link = "identity"), data = loyn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -15.8992 -2.7245 -0.2772 2.7052 11.2811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.47274 2.35785 9.531 1.83e-12 ***
## sl_dist 0.14456 1.19334 0.121 0.9041
## sl_ldist 0.34641 0.92835 0.373 0.7107
## sl_area 5.55123 1.22130 4.545 3.97e-05 ***
## fgraze2 0.52851 3.25221 0.163 0.8716
## fgraze3 0.06601 2.95871 0.022 0.9823
## fgraze4 -1.24877 3.19838 -0.390 0.6980
## fgraze5 -12.47309 4.77827 -2.610 0.0122 *
## salt 0.01070 0.02390 0.448 0.6565
## syr_isol -0.01277 0.05803 -0.220 0.8267
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 37.27021)
##
## Null deviance: 6337.9 on 55 degrees of freedom
## Residual deviance: 1714.4 on 46 degrees of freedom
## AIC: 372.52
##
## Number of Fisher Scoring iterations: 2
Evidence of an effect of log-area and of grazing area 5 being less than grazing 1, while all other grazing levels were not different from grazing 1. Need to investigate pair-wise effects among the factor levels thereafter.